Prepare Dataframe

Number of Observations

Clustering

Data Preparation

To perform a cluster analysis in R, generally, the data should be prepared as follows:

  1. Rows are observations (individuals) and columns are variables.

  2. Any missing value in the data must be removed or estimated.

  3. The data must be standardized (i.e., scaled) to make variables comparable.

I group the data by medium & year to compute the mean value for each p_group:

Unweighted DF

Bündnis 90/ Die Grüne CDU/CSU FDP Linke/PDS/WASG SPD
BamS -0.084 -0.023 -0.037 -0.171 -0.081
Bericht aus Berlin -0.111 -0.115 -0.153 -0.148 -0.125
Berlin direkt -0.099 -0.085 -0.089 -0.158 -0.098
Berliner -0.065 -0.088 -0.062 -0.082 -0.077
Bild -0.105 -0.048 -0.043 -0.161 -0.085
Die Welt -0.101 -0.047 -0.039 -0.124 -0.096
Die Woche -0.107 -0.12 -0.078 -0.046 -0.06
Die Zeit -0.11 -0.101 -0.121 -0.083 -0.093
F.A.S. -0.062 -0.04 -0.059 -0.115 -0.08
F.A.Z. -0.077 -0.041 -0.034 -0.076 -0.074
Fakt -0.118 -0.104 -0.245 -0.164 -0.215
Focus -0.118 -0.052 -0.049 -0.153 -0.11
Fr. Rundschau -0.066 -0.102 -0.083 -0.08 -0.078
Frontal 21 -0.098 -0.187 -0.178 -0.091 -0.19
heute -0.049 -0.07 -0.071 -0.077 -0.062
heute journal -0.055 -0.085 -0.09 -0.105 -0.07
Kontraste -0.111 -0.184 -0.155 -0.123 -0.204
Monitor -0.158 -0.238 -0.25 -0.172 -0.148
Panorama -0.106 -0.182 -0.22 -0.37 -0.201
Plusminus -0.201 -0.154 -0.226 0.042 -0.168
ProSieben -0.098 -0.07 -0.064 -0.1 -0.074
Report (BR) -0.139 -0.133 -0.117 -0.208 -0.233
Report (SWR) -0.086 -0.223 -0.348 -0.18 -0.172
Rh. Merkur -0.128 -0.046 -0.053 -0.129 -0.118
RTL Aktuell -0.075 -0.072 -0.082 -0.075 -0.064
Sat.1 News -0.094 -0.058 -0.04 -0.124 -0.078
Spiegel -0.066 -0.089 -0.098 -0.092 -0.064
Stern -0.064 -0.088 -0.063 -0.026 -0.087
Super Illu -0.208 -0.051 -0.096 -0.062 -0.135
SZ -0.063 -0.088 -0.066 -0.086 -0.077
Tagesschau -0.055 -0.08 -0.071 -0.075 -0.063
Tagesthemen -0.071 -0.091 -0.097 -0.082 -0.075
tageszeitung -0.071 -0.124 -0.098 -0.091 -0.09
WamS -0.097 -0.035 -0.041 -0.122 -0.119
WISO -0.134 -0.074 -0.107 0.026 -0.086

Weighted DF

Bündnis 90/ Die Grüne CDU/CSU FDP Linke/PDS/WASG SPD
BamS -0.008 -0.011 -0.005 -0.003 -0.023
Bericht aus Berlin -0.01 -0.048 -0.026 -0.012 -0.03
Berlin direkt -0.009 -0.037 -0.012 -0.01 -0.026
Berliner -0.012 -0.03 -0.005 -0.005 -0.025
Bild -0.009 -0.022 -0.005 -0.005 -0.028
Die Welt -0.013 -0.019 -0.004 -0.005 -0.031
Die Woche -0.021 -0.042 -0.006 -0.003 -0.019
Die Zeit -0.017 -0.036 -0.007 -0.004 -0.036
F.A.S. -0.008 -0.016 -0.005 -0.005 -0.027
F.A.Z. -0.01 -0.016 -0.003 -0.004 -0.024
Fakt -0.013 -0.033 -0.014 -0.016 -0.09
Focus -0.012 -0.023 -0.005 -0.005 -0.034
Fr. Rundschau -0.011 -0.035 -0.006 -0.004 -0.028
Frontal 21 -0.008 -0.088 -0.023 -0.005 -0.05
heute -0.006 -0.03 -0.008 -0.003 -0.019
heute journal -0.006 -0.038 -0.01 -0.004 -0.022
Kontraste -0.012 -0.067 -0.012 -0.012 -0.071
Monitor -0.016 -0.102 -0.03 -0.003 -0.049
Panorama -0.006 -0.078 -0.039 -0.012 -0.062
Plusminus -0.016 -0.058 -0.037 0 -0.064
ProSieben -0.013 -0.028 -0.004 -0.002 -0.028
Report (BR) -0.017 -0.054 -0.008 -0.009 -0.083
Report (SWR) -0.005 -0.101 -0.041 -0.007 -0.057
Rh. Merkur -0.018 -0.019 -0.004 -0.005 -0.038
RTL Aktuell -0.007 -0.033 -0.007 -0.002 -0.022
Sat.1 News -0.01 -0.025 -0.002 -0.004 -0.03
Spiegel -0.007 -0.037 -0.008 -0.004 -0.023
Stern -0.008 -0.035 -0.005 -0.001 -0.032
Super Illu -0.011 -0.02 -0.005 -0.013 -0.04
SZ -0.01 -0.034 -0.005 -0.004 -0.026
Tagesschau -0.006 -0.034 -0.008 -0.003 -0.019
Tagesthemen -0.008 -0.039 -0.011 -0.003 -0.023
tageszeitung -0.017 -0.039 -0.006 -0.006 -0.029
WamS -0.01 -0.015 -0.004 -0.004 -0.04
WISO -0.011 -0.029 -0.013 0.002 -0.028

K-Means Clustering

  • Most commonly used unsupervised ML algorithm for partitioning a given data set into a set of k clusters, where k represents the number of pre-specified groups.

  • It classifies objects in multiple clusters, where each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

  • k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.

  • The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:

\[ W(C_k)=\sum_{x_i\in C_k}(x_i-\mu_k)^2 \] where:

  • \(x_i\) is a data point belonging to Cluster \(C_k\)
  • \(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)

Each observation (\(x_i\)) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers (\(\mu_k\)) is minimized.

The object function to be minimized is the total within-cluster sum of square:

\[ \text{tot.withiness} = \sum^k_{k=1}W(C_k)=\sum^k_{k=1}\sum_{x_i\in C_k}(x_i-\mu_k)^2 \] ### K-means Algorithm

K-means algorithm can be summarized as follows:

  1. Specify the number of clusters (K) to be created (by the analyst).

  2. Select randomly k objects from the data set as the initial cluster centers or means.

  3. Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid.

  4. For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a \(Kth\) cluster is a vector of length \(p\) containing the means of all variables for the observations in the \(kth\) cluster; \(p\) is the number of variables.

  5. Iteratively minimize the total within sum of square (Equation above). That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached.

The output of kmeans is a list with several bits of information. The most important being:

  • cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
  • centers: A matrix of cluster centers.
  • totss: The total sum of squares.
  • withinss: Vector of within-cluster sum of squares, one component per cluster.
  • tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
  • betweenss: The between-cluster sum of squares, i.e. \(totss-tot.withinss\).
  • size: The number of points in each cluster.

If we print the results we’ll see that our groupings resulted in 3 cluster sizes of 29, 49, 287. We see the cluster centers (means) for the three groups across the four variables (Bündnis 90/ Die Grüne, CDU/CSU, FDP, SPD). We also get the cluster assignment for each observation (i.e. BamS was assigned to cluster 3 in year 2001, Bericht aus Berlin was assigned to cluster 1 in 2005, etc.).

We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

Unweighted

Weighted